arXiv:1504.01639v2 [cs.CV] 8 Jul 2015 


1 



ELSEVIER 


Image and Vision Computing 
journal homepage: www.elsevier.com 


Ego-Object Discovery 

Marc Bolanos^, Petia Radeva® '’ 

^ Universitat de Barcelona, Gran Via de les Corts Catalanes, 585, Barcelona 08007, Spain 
^Computer Vision Center, Building O Campus UAB, Bellaterra (Barcelona) 08193, Spain 


ABSTRACT 


Lifelogging devices are spreading faster everyday. This growth can represent great benefits to develop 
methods for extraction of meaningful information about the user wearing the device and his/her envi¬ 
ronment. In this paper, we propose a semi-supervised strategy for easily discovering objects relevant 
to the person wearing a first-person camera. Given an egocentric video/images sequence acquired 
by the camera, our algorithm uses both the appearance extracted by means of a convolutional neural 
network and an object refill methodology that allows to discover objects even in case of small amount 
of object appearance in the collection of images. An SVM filtering strategy is applied to deal with the 
great part of the False Positive object candidates found by most of the state of the art object detectors. 
We validate our method on a new egocentric dataset of 4912 daily images acquired by 4 persons as 
well as on both PASCAL 2012 and MSRC datasets. We obtain for all of them results that largely 
outperform the state of the art approach. We make public both the EDUB dataset^ and the algorithm 
code^. 

© 2015 Elsevier Ltd. All rights reserved. 


1. Introduction 


Ubiquitous computing is more present everyday in our lives, 
and with it lifelogging devices ( [Hodges et al.[ |2006t [Michael 


[2013| ) are increasing their popularity and spread. By us¬ 
ing wearable cameras, we can acquire continuous data about 
the life of persons, and build applications that convert this 
huge amount of data into meaningful information about their 
lifestyle. Hence, wearable cameras offer an easy manner to ac¬ 
quire information about our daily life tasks, and extract infor¬ 
mation about our typical activities and habits ( Betancourt et al.| ) 
from an egocentric (or first-person) point of view. For exam¬ 
ple, Fig. shows datasets acquired in three days by 3 different 
users. We can observe that different persons have different envi¬ 
ronments. Probably, the most remarkable reason for being able 
to detect visually the differences in the users’ datasets is usually 
due to the distribution and aspect of scenes, objects and people 
that appear. Following these premises, in this paper, we address 
the problem of automatically discovering which are the usual 


**Marc Bolanos: TeL: -^34-669-648-301 
e-mail: marc. bolanosOub. edu (Marc Bolanos) 

^ https ://www.dropbox.com/s/py Sxhalqxz 15 co3/EDUB %202015 .zip?dl=0 
^ https ://github. com/MarcB S/Ego-Obj ect JDiscovery/releases 


objects that form the environment of a person wearing the cam¬ 
era by means of a novel Object Discovery (OD) method. We 
must note the difference between Object Recognition, where 
the goal is to discriminate objects according to their classes by 
a classifier previously trained with a set of training samples; 



Fig. 1. Lifelogging sets from 3 users (each 2 row correspond to a different 
user). Note how objects help to discriminate different environments. The 
annotated objects are to be discovered by the object discovery algorithm. 
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Object Detection, where we should detect the subregion in the 
image where an object appears; and Object Discovery, where 
we have to both detect new object instances or concepts, and 
assign them a label even without having training examples from 
all possible classes of objects. 


1.1. Previous Work 

Several works have been previously done in the OD field, 


sell et al. 2006|), others extracting objects relying on visual 

words (Russell et al.||2006 

ISivic et al.| |2005[ |Liu and Chen| 

2007]). In (|Chatzilari et al. 

|20111), a semi-supervised method 


for segmentation-level labeling is presented and in ( [Tuytelaar^ 
et al. 2Q10| ) a comparison of unsupervised OD methods is 
shown. One of the best performing OD methods is the one Lee’s 
et.al. published in ( [Lee and Grauman[|201 1| ), where the authors 
propose a semi-supervised OD approach for object discovery. 
It starts by selecting the easiest objects by an objectness de¬ 
tector and keeps an iterative discovery procedure by clustering 
object candidates, selecting the best one as the one correspond¬ 
ing to the newly discovered object and applying an One-Class 
SVM to discover harder instances of it. The authors use a set 
of low-level image appearance (texture, colour and shape) and 
context features. One of its main drawbacks is that the features 
that it used are not rich enough to capture the characteristics 
of any existent real world object. More recently, in ( [Kading 


et al. 2015| ), a method for object discovery relying in active 
learning was presented. The authors base their work in the as¬ 
sumption that when dealing with an active learning problem, 
the oracle does not always know all the classes in advance and 
that, furthermore, not all the classes are always interesting for 
the problem at hand. With this in mind, they propose an Ex¬ 
pected Model Output Change (EMOC) criterion for selecting 
the most relevant and useful images to label for the problem 
they are addressing, and at the same time trying to avoid no 
valid objects by using a local density measure. Cho et al. in 
( |Cho et aL| |2015| ) worked on a part-based object discovery by 
proposing a new probabilistic matching strategy (Probabilistic 
Hough Matching) based on HOG descriptors for finding sim¬ 
ilar objects in different images. Additionally, they propose an 
associated confidence for finding the most outstanding object in 
each image. 

In egocentric data, object discovery has been studied in much 
less extent. There, the OD brings new challenges consider¬ 
ing the non-intentionallity of the images, that is, compared to 
usual intentional images, the objects and people (if any) usu¬ 
ally do not appear in centered positions, and partial occlusions 
produced by other objects or the image border are quite fre¬ 
quent. In ( [Kang et ak] |2011| ), the authors define a method for 
finding new objects that a person can encounter in their daily 
living. They start by applying a segmentation of the images at 
different levels, extracting colour, texture and shape informa¬ 
tion from each segment and applying a series of grouping and 
refinement steps to find consistent clusters that can represent 
new concepts. The authors in ( Eathi et al. 2011| ) develop an 
object recognition method that uses segmentation techniques 
for extracting objects on egocentric visual data. In this case. 


the data acquired is captured using head-mounted cameras with 
high-temporal resolution (about 30 fps), what makes impossi¬ 
ble to record the whole day of the person (due to memory and 
battery constraints). In order to solve this problem, we use cam¬ 
eras with low-temporal resolution (2-3 frames per minute) that 
are worn on chest level for maximizing the user comfort. As 
a result, we obtain a collection of images instead of a video, 
where objects are captured non-intentionally, and frequently ap¬ 
pear blurred and non-centred. The main additional challenges 
these cameras cause are: 1) having frames so much temporally 
spaced disable the possibility to directly infer information from 
sequential frames and 2) extracted motion information is not 
reliable enough. 


The main handicaps of existent OD methods are: 1) they lack 
a way to capture and reuse the knowledge acquired when ana¬ 
lyzing the previous data, which is very important considering 
the redundancy of the data acquired in lifelogging ( |Min et ak} 
|2014| ), and 2) many OD methods rely on using as a first step an 
object detection algorithm li ke (|Alexe et aH|2010t [Cheng et al.| 


|2014t[Arbelaez et ar[|2014t[Uijlings et~ [2013 ) for having an 

initial set of object candidates. As we prove in section [3^ these 
methods usually produce a very high number of Ealse Positives 
(EP) that should be dealt with. 


1.2. Contributions 


In this paper, we propose a new OD method for egocentric 
data (based on our previous work presented in ([Bolanos et al.| 


20151), that we call Ego-Object Discovery (EOD). Our contri¬ 


butions start by using a set of powerful features extracted by 
means of a Convolutional Neural Network (CNN). These net¬ 
works are proving their huge potential to address different prob¬ 
lems in the field of Computer Vision (( [Honglak et al.[ |2QQ9a|bt 


Goodfellow et al. 20141, just to mention a few). Lately, a new 
method ( [Moghimi et al. 2014 1 using CNN data has been pro¬ 
posed for egocentric activity recognition. However, no methods 
on OD using these features exist yet. To overcome the problem 
present in previous works of nonexistent knowledge reuse we 
use a new Refill methodology, which allows to discover new 
samples from the categories, even having a low number of in¬ 
stances, which are quite present in egocentric sequences. As ad¬ 
ditional contributions w.r.t. our previous work, we here present 
a strategy for solving the high number of EPs (or ’No Object’ 
candidates) produced by the object detection methods: a SVM 
filtering strategy. Also introduce the first egocentric object dis¬ 
covery dataset (EDUB) of lifelogging data with ground truth 
(GT) object segmentations, apply a comparison with the state 
of the art object detection algorithms, and analyze the results 
of our method also on two public datasets of intentional images 
(PASCAL and MSRC). 


The article is organized as follows: in section]^ we define 
the EOD algorithm. In section we present the datasets used 
to validate our method, the tests of EOD on all datasets, com¬ 
parison of state of the art object detectors and discussions on 
the obtained results. We finish with some conclusions and fu¬ 
ture work. 
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2. The Ego-Object Discovery Approach 

Given the problem of OD in low-temporal resolution egocen¬ 
tric data, our algorithm is formulated as an iterative procedure. 
At the beginning, it should be provided with a seed of initial ob¬ 
jects information to expand, defined as a small bag of labeled 
objects, represented by their regions, and called a bag of re¬ 
fill. The EOD algorithm passes through several steps (see Fig. 
[^: a) it detects image regions representing object candidates 
and their corresponding objectness scores from each new set of 
images, b) extracts object candidates features by using a pre¬ 
trained CNN, c) filters false object (’No Object’) instances and 
d) proceeds with a clustering-based iterative procedure as fol¬ 
lows: 1) on the easiest objects, it applies a refill strategy by us¬ 
ing the bag of refill, 2) clusters them by using an agglomerative 
clustering approach and labels the best cluster that represents 
the newly discovered object and 3) applies a supervised expan¬ 
sion to find harder instances of it. After a fixed number of t 
iterations or until no easy sample remains, it outputs the set of 
found object coordinates and labels. 

To describe and cluster the candidates, EOD uses both ap¬ 
pearance and local context features. Appearance are extracted 
by a CNN ( pTa| |2013| ), and context is provided by both the in¬ 
herent description of the object background that also extracts 
the CNN, and indirectly the refill procedure, that will introduce 
instances of the same classes but with different backgrounds. 
Being very suitable for lifelogging images considering the re¬ 
dundancy of the objects we routinely see. In the following sub¬ 
sections, we give details about each step of the EOD procedure. 

2.7. Object Candidates Preparation 

Object Candidates Generation: The first step needed to 
characterize the environment of the user through object discov¬ 
ery is extracting a set of object candidates for each image. To 
do so, we used the Objectness detector provided by Ferrari et 
al. in ( Alexe et'n^|201Q| ), which additionally to the bounding 
box for each candidate, outputs a score associated to the prob¬ 
ability of being a true object (objectness score). This score is 
produced by three visual cues: Multi-scale Saliency (finds blob¬ 
like structures at multiple scales that could indicate the presence 
of an object); Color Contrast (finds high colour differences be¬ 
tween the analyzed bounding box and its surroundings); and 
Superpixels Straddling (penalizes the bounding boxes that do 
not respect the boundaries of the superpixels in the image). 


INPUT: Day Lifelog 


Object Candidates Generation 



Object Candidates Characterization 



lift ft fe 




Iterative Discovery 


Easiest Objects Selection 


0.95 0.9 0.7 0.68 

PI" 

0.6 0.6 0.5 0.47 0.45 0.45 0.39 


Clustering & Hard Instances Classification [Refill 



OUTPUT: Object Labels & Locations 





□ paper □ hand □ tvmonitor □ person | cup 


Object Candidates Characterization: As features to 
cluster the object candidates, we used a pre-trained CNN 
( [Krizhevsky et al.[ |2Q12| ), which was trained on millions of 
images and is composed as a succession of convolutional and 
pooling layers. We deleted the last layer, which offers a su¬ 
pervised classification of 1.000 ImageNet classes, and used the 
output of the penultimate layer as our features (4096 variables). 
Note that our approach is different to the one of ( |Lee and Grau- 
201 1| ) that used: LAB histograms for extracting colour in 


man 


formation. Pyramid HOG for extracting shape information, and 
Spatial Pyramid Matching ( [Lazebnik et aH 2006| ) for extracting 


texture information. 


Fig. 2. Ego-Object Discovery methodology scheme. The different algo¬ 
rithms applied in each part of the methodology are represented in orange. 


False Objects Filtering: The main drawback of most object 
detection methods is the huge number of FPs, they produce. 
Given that it is not enough to rely on the objectness score for 
discarding the ’No Object’ instances, we filter the object can¬ 
didates by an RBF-SVM classifier trained on CNN features to 
distinguish ’Object’ vs. ’No Object’ instances. 
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2.2. Iterative Discovery 

Easiest Objects Selection: In order to achieve an iterative 
easy-first discovery, we used their associated objectness score 
to decide if a candidate oj is considered in the current iteration: 


3. Results 

In this section, we discuss the three datasets we used (sum¬ 
marizing their characteristics in Table[^, and expose the differ¬ 
ent tests applied to illustrate the EOD performance. 


objectnessScoreipj) > II + (Oicr - ( 02 t, (1) Datasets 


where ii and cr are respectively, the mean and the standard de¬ 
viation of all scores, t is the current iteration, and oji and 0 J 2 are 
weights. This easiness measure seems a promising method for 
obtaining object candidates in general. However, this technique 
does not obtain the same results in egocentric datasets than in 
intentional images due to the fact that images are not captured 
by a person looking at objects of the world, but are acquired 
non-intentionally while a person is loosely wearing the camera. 
As a result of the inherent low frequency of appearance of dif¬ 
ferent objects of the real world, to the limited image quality of 
the wearable egocentric devices and to the constant moving of 
the user, a great part of the photos are unclear, dark or blurry 
(see Fig. [^. All this causes lower precision, when clustering 
the obtained object candidates. 

Refill Strategy: In order to solve these problems, we define 
a ’’refill” methodology as follows: at each iteration, the set of 
selected easiest samples is completed with a certain percentage 
(w.r.t. the number of easy samples retrieved) of samples from 
the Bag of Refill, which are randomly chosen labeled samples 
distributed on the already discovered object classes. In this way, 
we address two problems: 1) difficulty to form a cluster from a 
very small set of class instances, and 2) difficulty to link sam¬ 
ples of the same class that were blurry and unclear. So, refilling 
the space with more samples of the same class of objects, we 
can obtain more compact clusters (see Fig. [^and Fig. |^. 




Fig. 3. Clusters formed by the 
easiest samples. 


Fig. 4. Clusters formed by the re¬ 
filled and easiest samples. 


Clustering and Hard Instances Classification: In this step 
we apply an Agglomerative Ward clustering on the object can¬ 
didates. Moreover, once the clusters are formed, we get the Sil¬ 
houette Coefficient ( |Tan and Steinbach] |2011| ) on each cluster 
and select the best for the user to assign it a label. This coeffi¬ 
cient is only calculated on the unlabeled samples, never using 
the refilled ones for selecting the most reliable cluster. At the 
end of each iteration, a OneClass-SVM for searching for harder 
instances is built with the new cluster and the rest of the easy 
samples are classified. 


Due to the low number of publicly available egocentric 
datasets and the complete lack of egocentric object-labeled 
datasets, we considered very important to construct one and 
make it public in order to serve as a base for algorithms com¬ 
parison for the egocentric community. 


The Egocentric Dataset of the University of Barcelona 
(EDUB) (see Fig. is a dataset composed of 4912 im¬ 
ages acquired by 4 people using the Narrative wearable cam¬ 
era (www.getnarrative.com). It is divided in 8 different days, 
2 days per person. The objects appearing in the images were 
segmented using the online tool FabelMe ( [Russell et al.| [2008] ) 
(although here we only use their bounding box) and their anno¬ 
tation files are similar to the ones provided by PASCAF. EDUB 
includes the following classes (number of samples per class 
are given in parenthesis): ’lamp’ (2299), ’tvmonitor’ (1274), 
’hand’ (1232), ’person’ (1175), ’glass’ (831), ’building’ (732), 
’face’ (565), ’aircon’ (530), ’sign’ (506), ’cupboard’ (392), 
’paper’ (377), ’car’ (315), ’bottle’ (260), ’door’ (199), ’chair’ 
(179), ’mobilephone’ (145), ’window’ (138), ’dish’ (65), ’mo¬ 
torbike’ (64), ’bicycle’ (12), and ’train’ (4). Note that in our 
tests, we did not use the classes with few instances (i.e. smaller 
than 100), considering that it would not be possible to discover 
them with a clustering strategy. 




Fig. 5. Object candidates obtained by the Ferrari’s objectness detector on 
the EDUB dataset. From left to right and top to bottom: aircon, bottle, 
building, car, chair, sign, cupboard, door, face, glass, hand, tvmonitor, 
lamp, mobilephone, paper, person, window. 


The second of the datasets we considered is the PASCAL 


VOC 2012 ( jEveringham et al.j [2012] ), being one of the most 
widely used in object detection/recognition research , with very 
difficult and challenging images. We used the ’trainval’ (for 
having more samples) set of images for our tests, but previously 
deleted the images that had in common with its 2007 version. 
We applied this pre-processing to avoid any bias in the results, 
since some of the used object detection methods were trained 
using PASCAL VOC 2007. 
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Table 1. Image/object characteristics for each of the used datasets. 



images 

object 

candidates 

GT objects 

classes 

MSRC 

3,427 

171,350 

4,217 

16 

PASCAL 

16,369 

818,450 

38,144 

20 

EDUB 

4,912 

245,600 

11,149 

17 


The last of the datasets, we chose is the Microsoft Research 
Cambridge (MSRC) g ee and Grauman[ |2005| ), which was 
also used in ( [Lee and Grauman| 201 1| ) for object discovery, and 
therefore will ease the comparison of the results. Considering 
that MSRC dataset is labeled at pixel level, we had to extract 
the bounding boxes corresponding to each of the objects mak¬ 
ing some assumptions: 1) the bounding box for an object is the 
minimal closing box around all the connected pixels that belong 
to the same class; 2) given the dataset is split in folders, we only 
considered valid the objects with the same class as the folder’s 
name; 3) the minimal area for an object to be valid was set to 
50x50 image pixels (about 0.81% of the whole image); and 4) 
we excluded the labels ’grass’, ’sky’, ’mountain’, ’water’ and 
’road’, because they are not objects, but rather environments. 

Fig. andshow some image samples from the 3 datasets. 
MSRC dataset, compared to the other two should obtain better 
results due to the position of the objects (central to the image) 
and their clear appearance. Even though in general PASCAL 
has some object instances very difficult to find, the hardest one 
is the EDUB (also considering the high rate of objects occlu¬ 
sions, blurriness and lower image quality). 



Fig. 6. MSRC image samples (top) and PASCAL 12 samples (bottom). 


3.2. Object Detection Methods 


Given that the first step of the algorithm is to obtain object 
candidates from the images, we tested and compared four dif¬ 
ferent state of the art object detection methods on the three 
datasets (see Table [^. We chose Objectness ( |Alexe et~ar 


2010| ), BING ( [Cheng et~ST 2014] ), Multiscale Combinatorial 
Grouping (MCG) ( ^belaez et al. [2014] ) and Selective Search 
( Uijlings et al. 2013 ) methods considering their good perfor¬ 
mances. For MCG, we applied its quickest, but less exhaustive 
version. 

Due to the dramatic increase of space needed to store all the 
sample^ we extracted the top IT = 50 object candidates per 
image sorted by their objectness score. 

Analyzing the percentage of NO (see overlapping score in 
section |3.3|) and DR of each method, we can see that the DR 


^Considering the PASCAL 12 dataset, we needed nearly 30GB of data to 
store all the images and features for the tests 


Table 2. Percentage of ’No Objects’ (NO) (or of False Positives) and De¬ 
tection Rate (DR) comparison of the four object detection methods on our 
three datasets. 




Objectness 

BING 

MCG 

Sel. Search 

MSRC 

NO 

DR 

91.69 

88.83 

96.68 

64.15 

48.42 

79.61 

61.95 

70.98 

PASCAL 

NO 

DR 

92.14 

60.47 

92.93 

56.93 

65.16 

49.36 

71.30 

36.71 

EDUB 

NO 

DR 

92.75 

60.45 

95.43 

50.00 

79.17 

49.57 

84.27 

29.09 


is not as high as desired and the % of NO is remarkably high. 
Meaning that using any of the best state of the art approaches 
for object detection makes us lose a lot of information, so we 
have to consider that our final results will be inevitably biased 
and worsen for this reason. 

Comparing the different datasets, as one could immediately 
expect looking at the images, it is clearly easier for any object¬ 
ness measure to get good results on the MSRC dataset, mean¬ 
while it is quite more difficult on PASCAL and EDUB, having 
an extra difficulty for the second one due to the non-intentional 
acquisition and less clear images of the wearable cameras. 

Given our final goal of being able to discover the true dis¬ 
tribution of object classes and as many individual GT objects 
as possible, we considered that the objectness measure that ob¬ 
tained better results for EOD was the one proposed by ( jAlexe 
et al.j 2010| ), because we are interested in getting most of the 
GT objects in the dataset, even if we have to deal with a lot of 
NO (i.e. noisy or FP) instances. 


3.3. Experimental Setup 


In order to perform the methodology validation, we first 
leave a 50% of the object classes in the unlabeled pool as 

a test set. Note that we need to test if the algorithm is able 
to discover unseen object classes. From the remaining part of 
classes, similar to ( [Lee and Grauman] [2011] ), we separated a 
40% of the total object candidates to represent the initial knowl¬ 
edge located into the bag of refill and used the remaining 60% 
for testing, too. 

In order to say that a candidate matches a GT object bounding 
box, we followed the PASCAL VOC challenge criterion, that 
uses the Overlapping Score (OS). Given a window region oj 
produced by the object detector, is considered a hit on a GT 
label, iff: 


OS 


\GT n u;| 

iGTAJM 


>0.5 


( 2 ) 


Due to the challenging images presented to the object detec¬ 
tor, a very high percentage of samples (more than 92% using 
Ferrari’s objectness) could not be considered objects, and were 
labeled as NO. 

In order to tune the parameters for the SVM filter strat¬ 
egy for each of the datasets, we applied a nested 5-fold cross- 
validation with 5 test divisions with a grid of parameters ofcre 
{0.1,0.5,3,10,100,1000} and C e {0.1,0.5,3,10,100,1000}. 
All the tests were performed for each dataset separately and on a 
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randomly selected fraction of its samples to save computational 
time. With these tests, we finally found that the best param¬ 
eters for filtering as many NO instances and at the same time 
keeping as many ’Object’ instances as possible (high sensitiv¬ 
ity and high specificity) for both the PASCAL and the MSRC 
classifiers were cr = 100 and C = 3. In the labeling step, for 
simulation purposes, we labeled the best cluster with a majority 
voting strategy w.r.t the GT, although this labeling is intended 
to be made by the camera user his-/herself. 

We designed different test settings to evaluate our proposal: 

SI: Features of ( [Lee and Grauman[|2011| ). 

S2: CNN object features. 

S3: CNN object features with Refill strategy. 



S4: CNN object concatenated with CNN scene features and 
Refill strategy. 

S5: CNN object features with Refill and SVM filter. 


Fig. 7. Comparison of mean silhouette coefficient (thick lines) and standard 
deviation (thin lines) for the top 15 clusters on 50 algorithm iterations (high 
values are better). 


S6: CNN object features with Refill, SVM filter and PCA. 
With the first pair of settings, we intend to compare the gen¬ 


eralization capabilities of the appearance features from (Lee 
and Grauman[ |2011[ ) against the extracted CNN features. In 
setting S4, we tested adding a context about the scene, and in 
setting S6, we applied a PCA feature dimensionality reduction 
and transformation in case there is redundancy in the extracted 
CNN features. 


3.4. Silhouette Coefficient Comparison 


( Sokolova and Lapalme[ |2009 1 in order to obtain the average 
F-Measure: 


F-Measure = 


^ PrecisionM * RecallM 

PrecisionM + RecallM ’ 


( 3 ) 


where PrecisionM and RecallM are the mean precision and re¬ 
call of all classes, giving the same weight to all of them. 

All measures were averaged by at least 5 executions per set¬ 
ting and for a maximum of 100 algorithm iterations. Using 
these tests, we compared all settings at the end of the easiest 
samples discovery (Fig[^ and on each iteration (Fig|^. 


In order to check if the clusters formed by using CNN fea¬ 
tures are more robust than the ones formed by using the features 


from (Lee and Grauman 20111, we can analyse the mean sil¬ 
houette coefficient values obtained in several iterations. In Fig. 
[ 7 ] we plot the difference on the silhouette coefficient values ob¬ 
tained by using the two kind of features. The comparison is 
applied for the top 15 clusters on the first 50 iterations of the 
algorithm. 

We can immediately realise that the average compactness of 
the clusters and their difference to the other clusters (which is 
what Silhouette Coefficient measures) is always higher when 
using CNN features, and will lead to get purer clusters and a 
better labeling. 


3.5. F-Measure Comparison on EDUB 


To evaluate our approach, we used the F-Measure, because it 
objectively penalizes the FP and FN objects in each class, that 
is, represents a trade-off between the Precision and Recall of the 
method. At the same time, we want to give the same importance 
to all classes, and are interested in finding as many different 
classes as possible, but always leaving the NO instances aside, 
without considering them into the quality measures. Hence, 
we applied the average per-class precision and recall defined in 
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Fig. 8. Final F-Measure for each Fig. 9. F-Measure evolution for 
setting. each different setting. 


Looking at Fig[^ we can clearly see that using CNN outper¬ 
forms the features of ( |Lee and Grauman 201 1| ), indicating that 
they can form purer clusters and find a wider variety of classes 
thanks to their best representation. Then, adding the Refill tech¬ 
nique, the EOD method outperforms the one using the CNN 
features only. The rest of the methods can not reach the same 
results as CNN -h Refill. Moreover, using the additional CNN 
features of the whole image adds just noise to the set of fea¬ 
tures. That is, simply by using the CNN with the bounding box 
of the object candidate already captures the closest and most 
relevant object context. Considering the high dimensionality 
of CNN features, it seems that including a PCA dimensionality 
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reduction to the data does not provide any benefit to the object 
discovery. 

Comparing the evolution of the F-Measure through the it¬ 
erations (Fig. [^, we see that any of the settings using CNN 
features experiments a much higher increase in the F-Measure 
value just in the first 5-10 iterations, meaning that they can find 
clusters of true objects quicker than using the setting SI. 

Also, using the CNN features combined with the refill strat¬ 
egy, the results clearly improved from 0.072 to 0.285. This is 
caused by the discovery of different classes of samples. While 
when using the features of ( Lee and Grauman[ |2011| ), we are 
only able to discover 3 or 4 classes at most, achieving an aver¬ 
age of 0.072 F-Measure; with the setting S3, we can discover 
instances of more than half of the classes, getting nearly 0.29 
of F-Measure. Although on the EDUB using the setting S5 
(CNN -r Refill -h SVM Filtering) does not seem to get as good 
F-Measure results as on the other settings, in other datasets, as 
we will be able to see, it outperforms or nearly reaches the re¬ 
sults of setting S3. Furthermore, it gets a wider variety of object 
classes. 


3.6. F-Measure Comparison on All Datasets 


After having found the best combination of methods and pa¬ 
rameters to use, we tested and compared how good the new 


method was contrasting it with the state of the art method ( Lee 
and Grauman[|201l) for any of the datasets (EDUB, PASCAL 
2012 and MSRC). In table we can see a summary of the F- 
Measure results obtained for each of the datasets and each of 
the best test settings (average on at least 5 tests per setting). 


Table 3. F-Measure comparison for the three datasets, the state of the art 
(Lee and Grauman| |20TT) and our best test settings (CNN + Refill and 
CNN + Refill + Filter). 


F-Measure 

SI 

S3 (ours) 

S5 (ours) 

MSRC 

0.121 

0.431 

0.410 

PASCAL 

0.002 

0.145 

0.179 

EDUB 

0.072 

0.285 

0.250 

Average 

0.065 

0.287 

0.280 


As we can see, using any of our best methods (either set¬ 
ting S3 or setting S5) clearly outperforms the state of the art 
features, having from a 350% to a 9000% of improvement de¬ 
pending on the dataset and the settings, and a 453% of average 
improvement with the best setting. 

Even though the average F-Measure result obtained using the 
SVM filtering (setting S5) is worse than without it (setting S3), 
we must consider that these classifiers have been built with sam¬ 
ples from different datasets than the ones on test (1/2 of the 
PASCAL samples for MSRC tests and all MSRC samples for 
both PASCAL and EDUB tests), meaning that the generaliza¬ 
tion will be poorer than if we built a general classifier with im¬ 
ages from any of the datasets. 

Another important consideration we must take into account, 
is that for the MSRC tests, although the final (after 100 itera¬ 
tions) F-Measure results are better without the filtering, in fact 


they were better with the filtering from the 1st to the 75th iter¬ 
ation, meaning that in some cases, it can offer better results if 
we want to stop early the discovery method. 

3.7. Object Discovery Results 


In this section, we analyze the object discovery results in 
more general terms. In Fig. we can see the absolute number 
of object instances found by each of the methods compared to 


the GT and the ones found by the Objectness measure ((Alexe 


et al. 2010| ), in this case without counting repeated instances of 


the same object). 
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Fig. 10. Objects found by each method compared to the GT and the ones 
found by the Objectness measure (Alexe et aTlllOlOh 


As we can see, using the parameters of setting S1 (Lee and 


Grauman 2011), we are only able to find instances from 3 dif¬ 


ferent classes, which causes the previously seen very low F- 
Measure results. On the other hand, using either CNN -i- Re¬ 
fill (setting S3) or CNN -h Refill -h Filter (setting S5), we can 
clearly discover objects from a wider variety of classes, which 
also causes the higher resulting F-Measure. Moreover, we get 
a wider variety of classes with setting S5 (10 different classes) 
than with setting S3 (8 different classes). 

If we check the discovery order of the classes in each of 
the methods (see Fig. [^, we can see that some classes are 
more easily discovered and repeated over the following itera¬ 
tions than others. This is caused not only by the number of 
class instances appearing in the dataset, but also by the pre¬ 
viously acquired knowledge (refill), the general method used, 
and/or the intra-class variability. 


Classes First Discovery 



0 10 20 30 40 50 60 70 80 90 100 

Iterations 


Fig. 12. First discovery of the object classes as a function of iterations. 
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Table 4. Number of clusters found for each class using any of the settings SI, S3 or S5. 


Test 

No Object 

hand 

lamp 

cupboard 

car 

glass 

chair 

face 

door 

window 

tvmonitor 

building 

paper 

person 

mobilephone 

sign 

Sl 

96 

0 

1 

2 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

S3 

71 

1 

0 

3 

0 

1 

0 

6 

0 

0 

8 

0 

4 

0 

0 

3 

S5 

49 

2 

3 

6 

0 

0 

4 

5 

1 

1 

23 

0 

1 

5 

0 

0 



Fig. 11. Examples of discovered objects for three different subjects (one row each). Better viewed in digital format. 


If we analyse the clusters number, where we find each class 
(see Table 1^, we can see that even though having the same per¬ 
centage of NO candidates (92.75%), using Grauman’s features 
(setting SI), we get 96% of the clusters labeled as NO, but only 
71% of them using CNN -h Refill (setting S3). Then, comparing 
it when adding the SVM filtering (setting S5), we can see that 
it gets reduced to a 49% of the clusters thanks to the dramatic 
reduction of NO instances in the pool of unlabeled samples. 

In Fig. we can see the evolution of GT unique instances 
discovered by each of the methods on the accumulated itera¬ 
tions (each data point corresponds to an algorithm iteration) 
w.r.t. the F-Measure obtained by the method. 



our methodology. We can see that it is able to discover instances 
of the same classes even having a high intra-class variability 
(person or hand). Note that some samples are not yet discovered 
due to the limited number of iterations applied (100). 

Regarding the complexity of EOD, it is easy to see that (in¬ 
dependently to the length of our feature vectors): 

• The objectness score extraction is of complexity 0(N), be¬ 
ing N the number of images in the dataset; 

• The SVM filtering has complexity (9(V); 

• The sorting of easiest objects is 0(N^ Wlog(N^ W)), being 
W the number of candidates extracted for each image; 

• The refill strategy is 0(1); 

• The CNN features extraction is 0(M), being M the easy 
objects number in the current iteration; 

• The clustering of easy objects is O(M^); 

• The best cluster labeling is 0(1); 


Fig. 13. Percentage of GT object discoveries accumulated on each iteration 
w.r.t. the F-Measure obtained. 

We can see that using Grauman’s features seems to cover a 
wider variety of object samples than either with settings S3 or 
S5 (about 16% against about 6-7% of the GT samples). This 
result is probably directly related to the lower F-Measure ob¬ 
tained. Due to the lower generalization and representation ca¬ 
pabilities of the set of features used (compared to CNN), the 
labeled clusters contain a wider variety of samples and objects, 
causing to label more unique object instances, but at the same 
time having a worse average result. 

In Fig. pT] there are some examples of objects discovered by 


• The one-class SVM cost is 0(M). 

Leading in total a cost of 0(N * Wlog(N * W) + M^), for each 
iteration. 

4. Conclusions 

In this paper, we proposed a novel semi-supervised object 
discovery algorithm for egocentric data that relies on features 
extracted from a pre-trained CNN and uses a refill strategy for 
finding easily the classes with less samples. Moreover, we 
added a SVM filtering strategy for discarding a great part of 
the high amount of ’No Object’ classes produced by any of the 
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objectness measures. We compared 4 of the state of the art ob- 
jectness measures in terms of ’No Object’ instances produced 
and the Detection Rate obtained when extracting a low number 
of object candidates (W=50). We proved that the CNN fea¬ 
tures, the refill strategy (and the SVM filtering) can produce 
much better F-Measure results and can discover a larger num¬ 
ber of infrequent classes than the state of the art approach on 
three datasets (MSRC, PASCAL 12 and EDUB), either being 
from general easy images, to egocentric and very difficult ones. 
Furthermore, we proved that this combined strategy also works 
better than the previous ones for very noisy and blurry images. 

5. Future Work 

Our future work involves the following tasks: 

1. Define an algorithm to discover objects, scenes and people 
to characterize the environment of the persons wearing the 
camera, 

2. Propose an iterative and combined scene and object dis¬ 
covery to take profit of the samples discovered from the 
complementary categories, and 

3. Make the method discriminative i.e. to detect which are 
the objects and scenes that characterize the environment 
of a person and distinguish them with respect to those of 
the other people. 
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